AAAI.2020 - Humans and AI | Cool Papers - Immersive Paper Discovery

#1 CoCoX: Generating Conceptual and Counterfactual Explanations via Fault-Lines [PDF] [Copy] [Kimi]

Authors: Arjun Akula ; Shuai Wang ; Song-Chun Zhu

We present CoCoX (short for Conceptual and Counterfactual Explanations), a model for explaining decisions made by a deep convolutional neural network (CNN). In Cognitive Psychology, the factors (or semantic-level features) that humans zoom in on when they imagine an alternative to a model prediction are often referred to as fault-lines. Motivated by this, our CoCoX model explains decisions made by a CNN using fault-lines. Specifically, given an input image I for which a CNN classification model M predicts class cpred, our fault-line based explanation identifies the minimal semantic-level features (e.g., stripes on zebra, pointed ears of dog), referred to as explainable concepts, that need to be added to or deleted from I in order to alter the classification category of I by M to another specified class calt. We argue that, due to the conceptual and counterfactual nature of fault-lines, our CoCoX explanations are practical and more natural for both expert and non-expert users to understand the internal workings of complex deep learning models. Extensive quantitative and qualitative experiments verify our hypotheses, showing that CoCoX significantly outperforms the state-of-the-art explainable AI models. Our implementation is available at https://github.com/arjunakula/CoCoX

#2 Towards Awareness of Human Relational Strategies in Virtual Agents [PDF] [Copy] [Kimi]

Authors: Ian Beaver ; Cynthia Freeman ; Abdullah Mueen

As Intelligent Virtual Agents (IVAs) increase in adoption and further emulate human personalities, we are interested in how humans apply relational strategies to them compared to other humans in a service environment. Human-computer data from three live customer service IVAs was collected, and annotators marked all text that was deemed unnecessary to the determination of user intention as well as the presence of multiple intents. After merging the selections of multiple annotators, a second round of annotation determined the classes of relational language present in the unnecessary sections such as Greetings, Backstory, Justification, Gratitude, Rants, or Expressing Emotions. We compare the usage of such language in human-human service interactions. We show that removal of this language from task-based inputs has a positive effect by both an increase in confidence and improvement in responses, as evaluated by humans, demonstrating the need for IVAs to anticipate relational language injection. This work provides a methodology to identify relational segments and a baseline of human performance in this task as well as laying the groundwork for IVAs to reciprocate relational strategies in order to improve their believeability.

#3 Regression under Human Assistance [PDF] [Copy] [Kimi]

Authors: Abir De ; Paramita Koley ; Niloy Ganguly ; Manuel Gomez-Rodriguez

Decisions are increasingly taken by both humans and machine learning models. However, machine learning models are currently trained for full automation—they are not aware that some of the decisions may still be taken by humans. In this paper, we take a first step towards the development of machine learning models that are optimized to operate under different automation levels. More specifically, we first introduce the problem of ridge regression under human assistance and show that it is NP-hard. Then, we derive an alternative representation of the corresponding objective function as a difference of nondecreasing submodular functions. Building on this representation, we further show that the objective is nondecreasing and satisfies α-submodularity, a recently introduced notion of approximate submodularity. These properties allow a simple and efficient greedy algorithm to enjoy approximation guarantees at solving the problem. Experiments on synthetic and real-world data from two important applications—medical diagnosis and content moderation—demonstrate that the greedy algorithm beats several competitive baselines.

#4 MIMAMO Net: Integrating Micro- and Macro-Motion for Video Emotion Recognition [PDF] [Copy] [Kimi]

Authors: Didan Deng ; Zhaokang Chen ; Yuqian Zhou ; Bertram Shi

Spatial-temporal feature learning is of vital importance for video emotion recognition. Previous deep network structures often focused on macro-motion which extends over long time scales, e.g., on the order of seconds. We believe integrating structures capturing information about both micro- and macro-motion will benefit emotion prediction, because human perceive both micro- and macro-expressions. In this paper, we propose to combine micro- and macro-motion features to improve video emotion recognition with a two-stream recurrent network, named MIMAMO (Micro-Macro-Motion) Net. Specifically, smaller and shorter micro-motions are analyzed by a two-stream network, while larger and more sustained macro-motions can be well captured by a subsequent recurrent network. Assigning specific interpretations to the roles of different parts of the network enables us to make choice of parameters based on prior knowledge: choices that turn out to be optimal. One of the important innovations in our model is the use of interframe phase differences rather than optical flow as input to the temporal stream. Compared with the optical flow, phase differences require less computation and are more robust to illumination changes. Our proposed network achieves state of the art performance on two video emotion datasets, the OMG emotion dataset and the Aff-Wild dataset. The most significant gains are for arousal prediction, for which motion information is intuitively more informative. Source code is available at https://github.com/wtomin/MIMAMO-Net.

#5 Conditional Generative Neural Decoding with Structured CNN Feature Prediction [PDF] [Copy] [Kimi]

Authors: Changde Du ; Changying Du ; Lijie Huang ; Huiguang He

Decoding visual contents from human brain activity is a challenging task with great scientific value. Two main facts that hinder existing methods from producing satisfactory results are 1) typically small paired training data; 2) under-exploitation of the structural information underlying the data. In this paper, we present a novel conditional deep generative neural decoding approach with structured intermediate feature prediction. Specifically, our approach first decodes the brain activity to the multilayer intermediate features of a pretrained convolutional neural network (CNN) with a structured multi-output regression (SMR) model, and then inverts the decoded CNN features to the visual images with an introspective conditional generation (ICG) model. The proposed SMR model can simultaneously leverage the covariance structures underlying the brain activities, the CNN features and the prediction tasks to improve the decoding accuracy and interpretability. Further, our ICG model can 1) leverage abundant unpaired images to augment the training data; 2) self-evaluate the quality of its conditionally generated images; and 3) adversarially improve itself without extra discriminator. Experimental results show that our approach yields state-of-the-art visual reconstructions from brain activities.

#6 GaSPing for Utility [PDF] [Copy] [Kimi]

Authors: Mengyang Gu ; Debarun Bhattacharjya ; Dharmashankar Subramanian

High-consequence decisions often require a detailed investigation of a decision maker's preferences, as represented by a utility function. Inferring a decision maker's utility function through assessments typically involves an elicitation phase where the decision maker responds to a series of elicitation queries, followed by an estimation phase where the state-of-the-art for direct elicitation approaches in practice is to either fit responses to a parametric form or perform linear interpolation. We introduce a Bayesian nonparametric method involving Gaussian stochastic processes for estimating a utility function from direct elicitation responses. Advantages include the flexibility to fit a large class of functions, favorable theoretical properties, and a fully probabilistic view of the decision maker's preference properties including risk attitude. Through extensive simulation experiments as well as two real datasets from management science, we demonstrate that the proposed approach results in better function fitting.

#7 Harnessing GANs for Zero-Shot Learning of New Classes in Visual Speech Recognition [PDF] [Copy] [Kimi]

Authors: Yaman Kumar ; Dhruva Sahrawat ; Shubham Maheshwari ; Debanjan Mahata ; Amanda Stent ; Yifang Yin ; Rajiv Ratn Shah ; Roger Zimmermann

Visual Speech Recognition (VSR) is the process of recognizing or interpreting speech by watching the lip movements of the speaker. Recent machine learning based approaches model VSR as a classification problem; however, the scarcity of training data leads to error-prone systems with very low accuracies in predicting unseen classes. To solve this problem, we present a novel approach to zero-shot learning by generating new classes using Generative Adversarial Networks (GANs), and show how the addition of unseen class samples increases the accuracy of a VSR system by a significant margin of 27% and allows it to handle speaker-independent out-of-vocabulary phrases. We also show that our models are language agnostic and therefore capable of seamlessly generating, using English training data, videos for a new language (Hindi). To the best of our knowledge, this is the first work to show empirical evidence of the use of GANs for generating training samples of unseen classes in the domain of VSR, hence facilitating zero-shot learning. We make the added videos for new classes publicly available along with our code1.

#8 Graph-Based Decoding Model for Functional Alignment of Unaligned fMRI Data [PDF] [Copy] [Kimi]

Authors: Weida Li ; Mingxia Liu ; Fang Chen ; Daoqiang Zhang

Aggregating multi-subject functional magnetic resonance imaging (fMRI) data is indispensable for generating valid and general inferences from patterns distributed across human brains. The disparities in anatomical structures and functional topographies of human brains warrant aligning fMRI data across subjects. However, the existing functional alignment methods cannot handle well various kinds of fMRI datasets today, especially when they are not temporally-aligned, i.e., some of the subjects probably lack the responses to some stimuli, or different subjects might follow different sequences of stimuli. In this paper, a cross-subject graph that depicts the (dis)similarities between samples across subjects is used as a priori for developing a more flexible framework that suits an assortment of fMRI datasets. However, the high dimension of fMRI data and the use of multiple subjects makes the crude framework time-consuming or unpractical. To address this issue, we further regularize the framework, so that a novel feasible kernel-based optimization, which permits non-linear feature extraction, could be theoretically developed. Specifically, a low-dimension assumption is imposed on each new feature space to avoid overfitting caused by the high-spatial-low-temporal resolution of fMRI data. Experimental results on five datasets suggest that the proposed method is not only superior to several state-of-the-art methods on temporally-aligned fMRI data, but also suitable for dealing with temporally-unaligned fMRI data.

#9 Multi-Source Domain Adaptation for Visual Sentiment Classification [PDF] [Copy] [Kimi]

Authors: Chuang Lin ; Sicheng Zhao ; Lei Meng ; Tat-Seng Chua

Existing domain adaptation methods on visual sentiment classification typically are investigated under the single-source scenario, where the knowledge learned from a source domain of sufficient labeled data is transferred to the target domain of loosely labeled or unlabeled data. However, in practice, data from a single source domain usually have a limited volume and can hardly cover the characteristics of the target domain. In this paper, we propose a novel multi-source domain adaptation (MDA) method, termed Multi-source Sentiment Generative Adversarial Network (MSGAN), for visual sentiment classification. To handle data from multiple source domains, it learns to find a unified sentiment latent space where data from both the source and target domains share a similar distribution. This is achieved via cycle consistent adversarial learning in an end-to-end manner. Extensive experiments conducted on four benchmark datasets demonstrate that MSGAN significantly outperforms the state-of-the-art MDA approaches for visual sentiment classification.

#10 Learning Graph Convolutional Network for Skeleton-Based Human Action Recognition by Neural Searching [PDF] [Copy] [Kimi]

Authors: Wei Peng ; Xiaopeng Hong ; Haoyu Chen ; Guoying Zhao

Human action recognition from skeleton data, fuelled by the Graph Convolutional Network (GCN) with its powerful capability of modeling non-Euclidean data, has attracted lots of attention. However, many existing GCNs provide a pre-defined graph structure and share it through the entire network, which can loss implicit joint correlations especially for the higher-level features. Besides, the mainstream spectral GCN is approximated by one-order hop such that higher-order connections are not well involved. All of these require huge efforts to design a better GCN architecture. To address these problems, we turn to Neural Architecture Search (NAS) and propose the first automatically designed GCN for this task. Specifically, we explore the spatial-temporal correlations between nodes and build a search space with multiple dynamic graph modules. Besides, we introduce multiple-hop modules and expect to break the limitation of representational capacity caused by one-order approximation. Moreover, a corresponding sampling- and memory-efficient evolution strategy is proposed to search in this space. The resulted architecture proves the effectiveness of the higher-order approximation and the layer-wise dynamic graph modules. To evaluate the performance of the searched model, we conduct extensive experiments on two very large scale skeleton-based action recognition datasets. The results show that our model gets the state-of-the-art results in term of given metrics.

#11 UCF-STAR: A Large Scale Still Image Dataset for Understanding Human Actions [PDF] [Copy] [Kimi]

Authors: Marjaneh Safaei ; Pooyan Balouchian ; Hassan Foroosh

Action recognition in still images poses a great challenge due to (i) fewer available training data, (ii) absence of temporal information. To address the first challenge, we introduce a dataset for STill image Action Recognition (STAR), containing over $1M$ images across 50 different human body-motion action categories. UCF-STAR is the largest dataset in the literature for action recognition in still images. The key characteristics of UCF-STAR include (1) focusing on human body-motion rather than relatively static human-object interaction categories, (2) collecting images from the wild to benefit from a varied set of action representations, (3) appending multiple human-annotated labels per image rather than just the action label, and (4) inclusion of rich, structured and multi-modal set of metadata for each image. This departs from existing datasets, which typically provide single annotation in a smaller number of images and categories, with no metadata. UCF-STAR exposes the intrinsic difficulty of action recognition through its realistic scene and action complexity. To benchmark and demonstrate the benefits of UCF-STAR as a large-scale dataset, and to show the role of “latent” motion information in recognizing human actions in still images, we present a novel approach relying on predicting temporal information, yielding higher accuracy on 5 widely-used datasets.

#12 Towards Socially Responsible AI: Cognitive Bias-Aware Multi-Objective Learning [PDF] [Copy] [Kimi]

Authors: Procheta Sen ; Debasis Ganguly

Human society had a long history of suffering from cognitive biases leading to social prejudices and mass injustice. The prevalent existence of cognitive biases in large volumes of historical data can pose a threat of being manifested as unethical and seemingly inhumane predictions as outputs of AI systems trained on such data. To alleviate this problem, we propose a bias-aware multi-objective learning framework that given a set of identity attributes (e.g. gender, ethnicity etc.) and a subset of sensitive categories of the possible classes of prediction outputs, learns to reduce the frequency of predicting certain combinations of them, e.g. predicting stereotypes such as ‘most blacks use abusive language’, or ‘fear is a virtue of women’. Our experiments conducted on an emotion prediction task with balanced class priors shows that a set of baseline bias-agnostic models exhibit cognitive biases with respect to gender, such as women are prone to be afraid whereas men are more prone to be angry. In contrast, our proposed bias-aware multi-objective learning methodology is shown to reduce such biases in the predictid emotions.

#13 Reinforcing an Image Caption Generator Using Off-Line Human Feedback [PDF] [Copy] [Kimi]

Authors: Paul Hongsuck Seo ; Piyush Sharma ; Tomer Levinboim ; Bohyung Han ; Radu Soricut

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.

#14 Instance-Adaptive Graph for EEG Emotion Recognition [PDF] [Copy] [Kimi]

Authors: Tengfei Song ; Suyuan Liu ; Wenming Zheng ; Yuan Zong ; Zhen Cui

To tackle the individual differences and characterize the dynamic relationships among different EEG regions for EEG emotion recognition, in this paper, we propose a novel instance-adaptive graph method (IAG), which employs a more flexible way to construct graphic connections so as to present different graphic representations determined by different input instances. To fit the different EEG pattern, we employ an additional branch to characterize the intrinsic dynamic relationships between different EEG channels. To give a more precise graphic representation, we design the multi-level and multi-graph convolutional operation and the graph coarsening. Furthermore, we present a type of sparse graphic representation to extract more discriminative features. Experiments on two widely-used EEG emotion recognition datasets are conducted to evaluate the proposed model and the experimental results show that our method achieves the state-of-the-art performance.

#15 Variational Pathway Reasoning for EEG Emotion Recognition [PDF] [Copy] [Kimi]

Authors: Tong Zhang ; Zhen Cui ; Chunyan Xu ; Wenming Zheng ; Jian Yang

Research on human emotion cognition revealed that connections and pathways exist between spatially-adjacent and functional-related areas during emotion expression (Adolphs 2002a; Bullmore and Sporns 2009). Deeply inspired by this mechanism, we propose a heuristic Variational Pathway Reasoning (VPR) method to deal with EEG-based emotion recognition. We introduce random walk to generate a large number of candidate pathways along electrodes. To encode each pathway, the dynamic sequence model is further used to learn between-electrode dependencies. The encoded pathways around each electrode are aggregated to produce a pseudo maximum-energy pathway, which consists of the most important pair-wise connections. To find those most salient connections, we propose a sparse variational scaling (SVS) module to learn scaling factors of pseudo pathways by using the Bayesian probabilistic process and sparsity constraint, where the former endows good generalization ability while the latter favors adaptive pathway selection. Finally, the salient pathways from those candidates are jointly decided by the pseudo pathways and scaling factors. Extensive experiments on EEG emotion recognition demonstrate that the proposed VPR is superior to those state-of-the-art methods, and could find some interesting pathways w.r.t. different emotions.

#16 Crowd-Assisted Disaster Scene Assessment with Human-AI Interactive Attention [PDF] [Copy] [Kimi]

Authors: Daniel (Yue) Zhang ; Yifeng Huang ; Yang Zhang ; Dong Wang

The recent advances of mobile sensing and artificial intelligence (AI) have brought new revolutions in disaster response applications. One example is disaster scene assessment (DSA) which leverages computer vision techniques to assess the level of damage severity of the disaster events from images provided by eyewitnesses on social media. The assessment results are critical in prioritizing the rescue operations of the response teams. While AI algorithms can significantly reduce the detection time and manual labeling cost in such applications, their performance often falls short of the desired accuracy. Our work is motivated by the emergence of crowdsourcing platforms (e.g., Amazon Mechanic Turk, Waze) that provide unprecedented opportunities for acquiring human intelligence for AI applications. In this paper, we develop an interactive Disaster Scene Assessment (iDSA) scheme that allows AI algorithms to directly interact with humans to identify the salient regions of the disaster images in DSA applications. We also develop new incentive designs and active learning techniques to ensure reliable, timely, and cost-efficient responses from the crowdsourcing platforms. Our evaluation results on real-world case studies during Nepal and Ecuador earthquake events demonstrate that iDSA can significantly outperform state-of-the-art baselines in accurately assessing the damage of disaster scenes.